Skip to content

Record: SLOT + LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1154 (3-seed mean) val_bpb = 1.1154 (3-seed mean, std 0.0002) | ~15.9 MB | 8×H100 SXM#1128

Open
AnubhavBharadwaaj wants to merge 5 commits intoopenai:mainfrom
AnubhavBharadwaaj:anubhav-slot-record
Open

Conversation

@AnubhavBharadwaaj
Copy link
Copy Markdown

@AnubhavBharadwaaj AnubhavBharadwaaj commented Mar 30, 2026

Record: SLOT + LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1154 (3-seed mean)

val_bpb = 1.1154 (3-seed mean, std 0.0002) | ~15.9 MB | 8×H100 SXM

3-Seed Results (8×H100 80GB SXM)

Seed step_avg steps Pre-TTT bpb Post-TTT+SLOT bpb TTT+SLOT time Artifact
1337 84.2ms 7,131 1.1381 1.1153 568s 15,997,676
42 84.1ms 7,133 1.1384 1.1156 568s 15,891,784
2025 83.9ms 7,151 1.1380 1.1153 571s 15,891,988
Mean 84.1ms 7,138 1.1382 1.1154 (std 0.0002) ~569s

vs Previous SOTA (PR #549)

Metric PR #549 This submission Delta
val_bpb (3-seed mean) 1.1194 1.1154 -0.0040
val_loss (3-seed mean) 1.8916 1.8833 -0.0083 nats
Significance (p < 0.01) Yes All 3 seeds individually beat SOTA
Record bar (≥0.005 nats) 0.0083 nats ✅ Cleared

Key Innovation: SLOT (Sample-specific LM Optimization at Test-time)

First SLOT-based entry in Parameter Golf. SLOT optimizes a single additive δ ∈ ℝ^512 vector at the last hidden layer during TTT scoring, adapting the model's hidden-to-logit mapping per-batch.

Source: Hu et al., arXiv:2505.12392v2, "SLOT: Sample-specific Language Model Optimization at Test-time" (Westlake University, 2025)

How SLOT Works

The model's forward_logits() is split into forward_hidden() + compute_logits(). During TTT Phase 1 (scoring), SLOT optimizes δ between the two:

for each batch of windows:
    # 1. Get hidden states from TTT-adapted model
    H = model.forward_hidden(x_batch)           # [bsz, seq_len, 512]

    # 2. Optimize delta (5 AdamW steps, lr=0.003)
    delta = zeros(1, 1, 512)                    # broadcasts across batch + seq
    optimizer = AdamW([delta], lr=0.003)
    for step in range(5):
        logits = model.compute_logits(H + delta)
        loss = CE(logits[:, :-1], targets[:, 1:])
        loss.backward()                          # gradients only through lm_head
        optimizer.step()

    # 3. Score with adapted logits
    final_logits = model.compute_logits(H + delta)
    nll = CE(final_logits, targets)              # used for BPB

Why SLOT Works

SLOT and TTT address complementary bottlenecks:

  • TTT adapts all 27M model weights to local data distribution (chunk-level, SGD, 3 epochs)
  • SLOT fine-tunes the final hidden→logit mapping per-batch (5 AdamW steps on 512 params)

TTT gives SLOT better hidden states; SLOT gives TTT-adapted representations a final per-batch correction. The two stack because they operate at different granularities (chunk vs batch) and different model depths (all layers vs last layer only).

SLOT Properties

SLOT Hyperparameters

Parameter Value Notes
Learning rate 0.003 Tuned up from paper's 0.001 default (our model is 27M vs paper's 7B)
Steps 5 Tuned up from paper's 3 default
Optimizer AdamW weight_decay=1e-8, eps=1e-5 (from paper)
Delta shape [1, 1, 512] Broadcasts across batch and sequence
Delta init zeros Matches paper

Hyperparameter Ablation (seed 1337)

SLOT Config BPB Delta vs baseline
Disabled (baseline) 1.1195
lr=0.001, steps=3 1.1188 -0.0007
lr=0.003, steps=5 1.1153 -0.0042

Also Tested: CTW — Negative Result

Context Tree Weighting (Willems et al., 1995) was integrated and tested across three progressively improved implementations. All degraded BPB.

CTW Version Change BPB Verdict
v1: Naive n-gram lookup Deepest-match KT estimate, fixed w=0.1 1.1252 +0.005 worse
v2: Proper recursive Full P_w = 0.5·P_e + 0.5·P_w_child + entropy gating Not tested (speed)
v3: Vectorized entropy gate Batch entropy, selective CTW loop Still worse Killed early

Root cause: The 11-layer transformer at 1.12 BPB already captures all n-gram patterns a depth-4 Markov model knows. Mixing in a weaker predictor adds noise regardless of implementation quality.

Also Tested: Stacking Hacks — Negative Results

Hack Mechanism BPB Verdict
Adaptive Temperature Optimize temp scalar per-batch via SGD 1.1325 +0.014 worse
Focal TTT Upweight hard tokens in Phase 2 via focal loss 1.1441 +0.025 worse

Base Architecture (PR #549 by @abaybektursun)

  • 11L, 512d, 8H/4KV, LeakyReLU(0.5)² MLP 3×
  • Parameter Banking + Parallel Muon (FlashAttention 3)
  • BigramHash(1536), XSA4, Partial RoPE(16), LN Scale, VE128
  • EMA(0.997) + Tight SWA(50), GPTQ-lite int6 + LZMA-6
  • Legal Score-First TTT (SGD, lr=0.002, 3 epochs, 32K chunks)

Run Command

cd /workspace/parameter-golf && SEED=1337 SLOT_ENABLED=1 SLOT_LR=0.003 SLOT_STEPS=5 \
CTW_WEIGHT=0 NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=1536 XSA_LAST_N=4 \
EMA_ENABLED=1 EMA_DECAY=0.997 SWA_ENABLED=1 SWA_EVERY=50 \
ROPE_DIMS=16 LN_SCALE=1 LATE_QAT=1 LATE_QAT_THRESHOLD=0.15 \
VE_ENABLED=1 VE_DIM=128 VE_LAYERS=9,10 \
TTT_ENABLED=1 TTT_LR=0.002 TTT_EPOCHS=3 TTT_CHUNK_TOKENS=32768 \
TTT_FREEZE_BLOCKS=0 TTT_MOMENTUM=0.9 TTT_BATCH_SEQS=32 TTT_GRAD_CLIP=1.0 \
MUON_WD=0.04 ADAM_WD=0.04 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3500 \
ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

@0hq or @valerio-oai
Hey @0hq, I've applied for the Development grant several times but no response yet. GitHub: AnubhavBharadwaaj. Could you help check the status?

…result on PR openai#549 stack

First SLOT (Sample-specific LM Optimization at Test-time) entry in Parameter Golf.
SLOT optimizes a delta vector at the last hidden layer inside the TTT scoring loop.

SLOT results (3-seed):
  seed 1337: 1.1188 BPB | seed 42: 1.1185 BPB | seed 2025: 1.1183 BPB
  mean: 1.1185 (std 0.0003) vs baseline 1.1193 — consistent -0.0008 improvement

Also documents CTW as a negative result across 3 implementation iterations:
  v1 (naive n-gram lookup): +0.005 worse, 46 min eval
  v2 (proper recursive weighting + entropy gating): not runnable in time budget
  v3 (vectorized entropy gate): still worse, killed early
  Root cause: signal redundancy — transformer already captures all n-gram patterns

Base: PR openai#549 by @abaybektursun (LeakyReLU² + Legal TTT + Parallel Muon)
…4 (3-seed mean)

First SLOT (Sample-specific LM Optimization at Test-time) entry in Parameter Golf.
Optimizes 512-dim delta vector at last hidden layer per-batch during TTT scoring.
AdamW lr=0.003, 5 steps. Splits forward_logits() into forward_hidden() + compute_logits().

3-seed results (8xH100 SXM):
  seed 1337: 1.1153 BPB | seed 42: 1.1156 BPB | seed 2025: 1.1153 BPB
  mean: 1.1154 (std 0.0002) | val_loss mean: 1.8833
  vs SOTA PR openai#549: -0.0083 nats (>0.005 required) ✅

Base: PR openai#549 by @abaybektursun
SLOT paper: Hu et al., arXiv:2505.12392v2
@AnubhavBharadwaaj AnubhavBharadwaaj changed the title Record: SLOT + LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1154 (3-seed mean) **val_bpb = 1.1154** (3-seed mean, std 0.0002) | ~15.9 MB | 8×H100 SXM Record: SLOT + LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1154 (3-seed mean) val_bpb = 1.1154 (3-seed mean, std 0.0002) | ~15.9 MB | 8×H100 SXM Mar 30, 2026
@dexhunter
Copy link
Copy Markdown

Hi @AnubhavBharadwaaj -- constructive observation about SLOT legality that might be worth considering.

After reviewing the organizer's enforcement pattern on Issue #677, I noticed that SLOT may fall under the same "adapt on validation before the reported eval pass" pattern that led to 33+ PR closures (valerio-oai, 2026-03-27):

  1. Condition 3 (Score before update): SLOT optimizes the delta using F.cross_entropy on target tokens (y_batch), then scores those same tokens with the optimized delta. The delta is the "runtime state" being updated using x_t before x_t is scored.

  2. Condition 1 (Causality): The delta has shape [1,1,512] and broadcasts across all positions. Since it's optimized over all positions in the batch, the prediction at position t is influenced by tokens at positions t+1, t+2, ..., which violates strict prefix-only dependence.

This differs from the legal score-first TTT in PR #549, where chunk N is scored first (under inference_mode()), then the model trains on chunk N for future chunks. SLOT adapts and scores the same tokens in the same batch.

No organizer has ruled on SLOT specifically, so this may be fine -- but I wanted to flag it so the community can discuss before multiple PRs build on this technique. An organizer clarification on Issue #677 or #1017 would help everyone.

(We had a SLOT-based submission at 1.1015 that we self-closed for this reason: PR #1172.)

sisegod added a commit to sisegod/parameter-golf that referenced this pull request Apr 8, 2026
After a careful audit of the transcript and the records/ directory, several
claims in the PR body were either fabricated or unverifiable. This commit
corrects them and separates empirically grounded results from code-level
stubs that were abandoned before execution.

Corrections:

1. SLOT origin and default values

   The PR body said 'PR openai#1176 introduced SLOT with default lr=0.003
   steps=5' and called our lr=0.1 steps=100 '33x too small'. Verified
   against the actual PR bodies on GitHub on 2026-04-08:

     PR openai#1128 (AnubhavBharadwaaj, opened 2026-03-30 09:43 UTC)
       SLOT_LR=0.003 SLOT_STEPS=5 (the actual origin + the defaults we
       meant to cite)

     PR openai#1176 (bigbag, opened 2026-03-31 09:45 UTC)
       SLOT_LR=0.005 SLOT_STEPS=8, QK-Gain=4.0, Muon-TTT
       (cites PR openai#1128 as its own SLOT reference)

   Fixed: SLOT origin now attributed to PR openai#1128, the lr=0.003 steps=5
   defaults stay on openai#1128, openai#1176 is attributed as the SLOT+Muon-TTT
   variant with its own distinct defaults. Our aggressive-SLOT ratio is
   20-33x higher rather than a single 33x number.

2. Shannon-floor numbers

   The PR body said 'rANS reaches 2.32 bits/weight on MLP-up vs a Shannon
   theoretical minimum of 2.28 bits/weight, the remaining 0.04 bits/weight
   is coding overhead'. The 2.28 number was fabricated.

   Actual measurement from running analyze_inter_layer.py (reported in
   the earlier session transcript):

     H(W_l) raw MLP-up Pentanary entropy, avg: 2.124 bits
     H(dW_l) inter-layer delta Pentanary entropy, avg: 2.128 bits
     delta_abs_mean / W_abs_mean ratio: ~1.4 (delta 40% larger than W)

   Fixed: replaced the fabricated 2.28 with the actual 2.124 / 2.128
   measurements, added the 1.4x magnitude ratio.

3. PR openai#1239 mis-reference in README

   README said 'Depth Recurrence (PR openai#1239 style)'. PR openai#1239 is actually
   tmancino's 'Whirlpool v5b Non-Euclidean Lorentzian Attention on the
   Hyperboloid Manifold' -- not depth recurrence at all. Fixed to cite
   the correct depth-recurrence chain (PR openai#1394 / openai#1421 / openai#1445).

4. Phase 1C ternary regression +0.014 -- FABRICATED

   The PR body claimed 'Phase 1C (Ternary BitNet b1.58 1-layer sanity):
   regression +0.014, abandoned'. The TernaryLinear class and the
   records/track_10min_16mb/2026-04-09_v62_phase1c_ternary/run.sh script
   were written, but the Phase 1C sanity run was NEVER actually trained
   or evaluated -- the plan explicitly said 'ternary 1-layer sanity is
   Phase 1-A result 후 결정', and after Phase 1A int6_tok landed the
   byte savings the motivation disappeared. The +0.014 number was
   invented.

   Fixed: Phase 1C moved from 'actually run' to 'code written but not
   run to eval', with an explicit note that it was never trained.

5. Phase 1B FP32 scalar Int8 '-0.05 MB only' -- NOT VERIFIED

   No measurement in the transcript. Fixed: Phase 1B moved to 'code
   written but not run', described as a stub only.

6. Phase 2B Hadamard / Phase 2C Context rANS / Phase 3 HQGRANS1 numbers

   Phase 2B 'no rANS gain' -- no measurement, planning note only.
   Phase 2C 'Rust codec rebuild blocker' -- true but never got to eval.
   Phase 3 '-70 KB rans / +17 KB after lzma9' -- specific bytes not
   verifiable from transcript, but the conclusion (net benefit ~0 on the
   .rans.ptz.xz path) is defensible from the lzma9-after-rANS
   architecture.

   Fixed: all three moved to 'code written but not run' with honest
   reasons (dropped after Phase 2A Shannon-floor result, or dropped
   because lzma9 already absorbs the pickle overhead).

7. 'Eleven completed-to-eval experiments' -- OVERCLAIM

   Only 10 experiments were actually run to eval, not 11. Fixed to '10
   actually-run experiments + 5 code-written stubs'.

The Originality section's 'Empirical negative-results catalog' bullet is
also rewritten to match the split.

What stays unchanged (verified):
  - Phase 1A int6_tok: +0.0006 regression, -0.61 MB xz (ACTUAL measurement)
  - Phase 1A pent_tok: +0.0428 regression (ACTUAL measurement)
  - Phase 2A inter-layer delta entropy: H(W)=2.124, H(dW)=2.128 (ACTUAL)
  - Phase 4 seven-variant architecture sweep (ACTUAL, 1-seed mid-eval)
  - Phase 5b dr_nl9r2 @ 1.151, dr_nl7r2 @ 1.166 (ACTUAL)
  - SLOT-100 3-seed @76% = 1.136399 (ACTUAL)
  - TTT 3-seed = 1.205215 (ACTUAL)
  - rANS codec originality + Pentanary MLP-up 2.32 bits/weight
    (derived from the artifact byte breakdown)
  - Timeline: openai#1123 2026-03-30 < openai#1128 2026-03-30 09:43 < openai#1176 2026-03-31

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants